A little while ago a colleague from a company I was working with took a keen interest in one of my data science projects which was developing a linear regression to make probabilistic predictions about future cases of student retention (i.e. when students are likely to complete their courses or when they are likely to drop out before completing).
My colleague asked me how the result of all this data science work could be extracted out of the Python Machine Learning algorithm and deployed into the production environment (which is not written in Python) so that users of the core production systems could spot students who are likely to drop out and then give them additional support to help them succeed.
That got me thinking about how I could extract the machine learning algorithm from Python and encode it in a different programming language (for example Excel VB Macros, C#.NET etc.) which led to this article.
As the student data is private I used a public data set from Kaggle relating to the prediction of the sales prices of houses. The full Kaggle data set can be found here
Let's start by importing the libraries we will need for the rest of the Notebook and by reading in the data which has been downloaded from Kaggle and placed in a sub directory called data.
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
Note that the data has already been through the data science pipeline. It has been cleaned, explored, modelled, tested, validated and interpreted and these steps are outside the scope of this article.
We are assuming that the data and the model are sound and that we just want to transfer the machine learning algorithm into a production environment.
Here is what our final data set looks like -
housing = pd.read_csv("data/housing_prices_cleaned.csv")
housing.head()
| OverallQual | TotalBsmtSF | CentralAir | GrLivArea | GarageCars | SalePrice | G_Fin | G_RFn | G_Unf | PV_N | PV_P | PV_Y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7 | 856 | 1 | 1710 | 2 | 208500 | 0 | 1 | 0 | 0 | 0 | 1 |
| 1 | 6 | 1262 | 1 | 1262 | 2 | 181500 | 0 | 1 | 0 | 0 | 0 | 1 |
| 2 | 7 | 920 | 1 | 1786 | 2 | 223500 | 0 | 1 | 0 | 0 | 0 | 1 |
| 3 | 7 | 756 | 1 | 1717 | 3 | 140000 | 0 | 0 | 1 | 0 | 0 | 1 |
| 4 | 8 | 1145 | 1 | 2198 | 3 | 250000 | 0 | 1 | 0 | 0 | 0 | 1 |
To avoid making the data and hence the article over-complicated I have chosen to drop some of the features that the previous stages of the data science pipeline had concluded were only making a marginal contribution to the prediction algorithm -
housing = housing.drop(columns=['CentralAir'], axis=1)
housing = housing.drop(columns=['G_Fin', 'G_RFn','G_Unf'], axis=1)
housing = housing.drop(columns=['PV_N', 'PV_P','PV_Y'], axis=1)
The final data set we will be working with is as follows -
housing.head()
| OverallQual | TotalBsmtSF | GrLivArea | GarageCars | SalePrice | |
|---|---|---|---|---|---|
| 0 | 7 | 856 | 1710 | 2 | 208500 |
| 1 | 6 | 1262 | 1262 | 2 | 181500 |
| 2 | 7 | 920 | 1786 | 2 | 223500 |
| 3 | 7 | 756 | 1717 | 3 | 140000 |
| 4 | 8 | 1145 | 2198 | 3 | 250000 |
Although the modelling stage is out of scope for this article, I did think it would be worth expending 1 line of code to have a look at the shape of the data. The following pairplot shows how each feature is distributed and confirms that the remaining features are correlated with the sale price which is our target / dependent variable.
sns.pairplot(housing, kind='reg', diag_kind='hist')
plt.show()
The machine learning model was build in a previous stage of the data science pipeline but it is going to be re-built here as we need the LinearRegression object to build our deployed algorithm.
Photo by Charles Deluvio on Unsplash
Here we are splitting the data into two sets, 70% of which will be used to train the model and 30% to test the model and then fitting the linear regression model to the training data set ...
# Seperate out the features into X and the target into y
X = housing.drop(['SalePrice'], axis = 'columns')
y = housing['SalePrice']
# Now split out data - 70% into the training dataset and 30% into the testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1111)
lr = LinearRegression() # Instantiate a LinearRegression object from the sklearn.linear_model library
# This odd looking code circumvents a bug in the LinearRegression class.
# When the code is run by executing the entire Notebook the following error is raised -
# LinAlgError: SVD did not converge in Linear Least Squares
# The bug can be avoided by simply running the code twice!
try:
lr.fit(X_train, y_train)
except:
lr.fit(X_train, y_train)
Again, the accuracy of the model would have been established in the "Test and Validate Model" stage of the data science pipeline but let's just have a quick look to make sure the model is performing as expected ...
print(f"r squared = {lr.score(X_train, y_train)}")
def adjusted_r_squared(X_train, y_train):
r2 = lr.score(X_train, y_train)
adj_r2 = (1 - (1 - r2) * ((X_train.shape[0] - 1) / (X_train.shape[0] - X_train.shape[1] - 1)))
return adj_r2
print(f"adjusted r squared = {adjusted_r_squared(X_train, y_train)}")
r squared = 0.746441600427625 adjusted r squared = 0.7454443205866323
Note: R-squared is a statistical measure of how close the actual data values of y are to the fitted regression line and in our case the model gets an R-squared of 74.6%.
However R-squared suffers from a disadvantage: the more features are added, the higher the value of R-squared. Hence the "Adjusted R-squared" is a better measure as it takes account of the number of features and just adding additional features can reduce the adjusted R-squared value.
Our Adjusted R-squared is 74.5% and this is good enough for out purposes, so let's continue.
The next line of code generates the predicted sales prices using the model and stores the results in y_pred
y_pred = lr.predict(X_test)
y_pred
array([162748.54740409, 179918.69651808, 56278.07271446, 200019.80456027,
124221.13505462, 188290.80972954, 187767.51786476, 236353.92910001,
203383.82381732, 124736.14652884, 249779.91378916, 277293.94824422,
107318.80062495, 125105.79534463, 216427.09465016, 282513.49728264,
184712.54207451, 180387.38578604, 296110.21415008, 195159.54764033,
203708.49009477, 197464.79266665, 200494.11000637, 157234.97422338,
125105.79534463, 106698.35229358, 147720.85322836, 122027.4257043 ,
123502.60707827, 151826.18411225, 193713.85031384, 120249.13443693,
292782.86612543, 186016.30832139, 285509.07811277, 176643.64290179,
195722.57876687, 216668.13536074, 198131.73286622, 174531.62685906,
175065.31864383, 252118.73779293, 179297.62377182, 102299.3954905 ,
120551.78186407, 311433.47676451, 104087.68743123, 133993.79632383,
57881.82163325, 248260.97027603, 165880.27603999, 82290.08696341,
192426.47040435, 272351.88119877, 100697.53001446, 124264.32967406,
173354.12785939, 125024.64762815, 190135.01338492, 138163.50530964,
122894.1446196 , 186374.90609191, 186107.65399169, 111964.50037312,
246388.11618683, 212257.60523875, 215774.59358277, 220440.71292135,
197070.9369269 , 38831.04968762, 159479.52342788, 172072.89807007,
108510.98888128, 176643.64290179, 171999.17637923, 100549.04627118,
135869.16220643, 186589.68550038, 124729.76086481, 153331.82667567,
212387.29590472, 118561.96191744, 199503.75272443, 205032.879537 ,
200196.87642588, 109516.92208901, 320237.45395374, 107626.32859127,
247867.54736061, 221030.48644803, 106236.78889892, 204573.79574689,
122972.40120126, 253479.19688944, 110427.75284297, 204059.18213046,
104087.68743123, 145563.21123533, 135425.79169981, 272351.36918535,
209161.29422372, 353514.09814244, 234521.44310391, 128884.43052916,
317606.7441567 , 202262.72395481, 146122.86289642, 93833.98674123,
163338.4905927 , 136335.41418438, 233671.6006108 , 329083.68661459,
132476.92406644, 303796.1944727 , 79237.09195606, 71053.98427346,
213400.01998228, 55270.11640929, 85281.23024285, 84838.90009784,
130883.84328105, 306353.53088955, 142824.97965292, 142132.28015435,
166757.51030435, 76140.78094102, 152824.11073668, 118242.2520714 ,
118148.98112451, 220946.09837795, 293213.04935725, 294550.22050317,
172197.6073484 , 128202.10635967, 118242.2520714 , 186589.68550038,
305263.26548336, 211594.11002125, 84430.59064531, 49792.53012333,
242194.99738148, 128464.39570401, 295864.3883088 , 218597.6706505 ,
91017.2221076 , 172299.89852979, 120649.55448942, 117688.32036846,
134538.60630925, 217035.92441655, 200108.1540423 , 156676.3189706 ,
225765.67386069, 81782.65611544, 174088.28865668, 271220.09595994,
274831.35559489, 333413.87676939, 271171.822132 , 213447.69994541,
132503.57599373, 152135.61415308, 193378.46672116, 204590.54939199,
278941.5684068 , 184015.81198521, 217818.75423727, 219801.41600635,
118637.49611808, 180624.61420684, 157598.11157043, 93113.96365033,
326195.37302889, 252310.76234682, 107354.42148544, 134829.94128289,
224666.34096983, 237406.65560172, 258534.26114777, 186227.46271024,
237755.56110737, 205207.78791985, 356740.61183686, 176348.75613845,
132845.53252061, 202271.82337841, 98485.87928944, 148696.73641158,
230012.02888452, 132057.59853517, 259305.85568067, 255447.65240791,
179027.17085485, 165333.90212655, 157416.99719579, 98106.55315471,
195798.32355514, 310802.00639525, 250334.28566444, 189766.10387028,
150021.64612337, 154881.76238876, 270777.76581493, 187513.38843087,
43622.08925748, 206581.03504452, 352067.90184045, 124779.48888161,
268299.90196245, 103336.6588332 , 111886.62706864, 342678.48241044,
211072.71251767, 265433.45023771, 129400.79709795, 115412.00085666,
205763.71078131, 123134.42878613, 49208.61427781, 356141.82073798,
155350.74179876, 143282.7379904 , 221692.44792227, 144273.43496153,
157999.76623739, 214095.13217994, 185044.06510371, 207351.38001703,
135795.44051559, 241601.99808183, 119157.5458751 , 242760.69538273,
123685.72858089, 122806.3098133 , 104087.68743123, 125105.79534463,
165618.01394886, 86166.93089447, 226591.56956461, 162415.35657078,
217204.38455029, 194246.52705854, 167783.228312 , 54319.16045409,
227153.2942356 , 228395.39320858, 353092.61774988, 186227.46271024,
168749.03631849, 125172.09100982, 199128.75860623, 98485.87928944,
142209.22761817, 217109.58618461, 62428.54644585, 106126.74833747,
215391.7652055 , 129306.40569223, 119266.96202167, 153155.06624616,
202450.20802403, 157173.6904959 , 216787.43904856, 178781.57193598,
129306.40569223, 275140.52513563, 210923.53847758, 97010.40511114,
174462.87756909, 319847.18533553, 184869.04395408, 170835.84866363,
135565.34568143, 107690.88497658, 70879.39594816, 166727.21015537,
247871.56757572, 160672.2500427 , 77034.41191839, 160205.98944329,
210125.50297033, 251337.52740654, 114703.61091135, 165639.04969802,
203708.49009477, 234079.22572569, 237178.69067262, 103793.8410295 ,
57881.82163325, 167723.31831085, 111975.90835049, 165758.24960583,
105628.4169131 , 117062.70501805, 277585.60923458, 176397.84340559,
122894.1446196 , 304356.19444442, 58459.61790867, 121412.28477727,
318496.94739491, 77285.89347465, 273247.41515879, 152495.01637603,
136360.66742912, 106159.8414351 , 312703.70644456, 296903.91800613,
214293.01616915, 199875.94527253, 133896.90301963, 224424.7972326 ,
186473.03667288, 271473.90799858, 188239.15643615, 156833.12227596,
101802.31501536, 300817.55328438, 188266.96974798, 194934.3493148 ,
115114.72368817, 147910.87240843, 146122.86289642, 139278.75611246,
156676.3189706 , 122961.4806464 , 268230.90115915, 231188.24335726,
149524.34048 , 252890.29103482, 68030.35458764, 167968.12874674,
143731.89433709, 112639.403568 , 152030.36595494, 121641.94944941,
109543.09255296, 165136.67346761, 160786.05334678, 199577.47441527,
236565.02761723, 246986.90728571, 218920.81089799, 267632.14061034,
174041.90918995, 167248.59307834, 110087.7052579 , 214460.83821554,
80675.58715169, 237034.47902309, 274030.55678092, 136760.72879534,
239055.85006794, 347271.69525156, 178780.93118632, 339827.50016919,
266456.85969159, 139278.75611246, 215373.90568218, 174489.43131022,
150167.87857336, 217666.86643761, 218892.55741383, 111551.28403634,
161278.63985678, 173403.19612945, 233365.09262017, 65400.59697635,
164089.96063651, 157026.24711424, 125105.79534463, 106299.33815625,
199718.5321329 , 143388.77467152, 224354.3481996 , 48266.55578003,
180702.32519739, 144419.83798158, 152844.75724951, 214981.61093892,
132183.07766471, 106299.33815625, 183139.77407192, 198897.69664038,
160983.93137683, 193934.0322852 , 180961.24140076, 151457.7307394 ,
120514.48585147, 141840.619164 , 206864.12459416, 128718.1581955 ,
132986.59023824, 121838.38191205, 164921.89405914, 149478.66764485,
186711.1031238 , 137122.43095061, 156810.02985168, 212981.17149789,
259747.4382684 , 179651.84998897, 23378.51059979, 172367.7848334 ,
182979.85776035, 204885.43615533, 234386.06744634, 232373.24288597,
140071.29858398, 127865.33409473, 276114.40119088, 285992.06740105,
161505.75447213, 218745.11403217, 188439.11343526, 207303.22896618,
146779.97244989, 148137.79101677, 329237.82304524, 120312.43987299,
140133.14387203, 232448.58246845, 196921.01532953, 198217.47273401,
106214.52588331, 82849.45480693, 236961.79769387, 117259.50176079,
86756.70442115, 249862.05322824, -15601.76055091, 107626.32859127,
192567.52812198, 208778.88142787, 250466.37269472, 189913.54725195,
118243.29243301, 190802.45734178])
Now let's take a copy of the testing data, add the actual price back in and also add in the predicted price that was generated by our linear regrssion machine learning algorithm ...
X_test_copy = X_test.copy()
X_test_copy['ActualPrice'] = y_test
X_test_copy['PredictedPrice'] = y_pred
X_test_copy.head()
| OverallQual | TotalBsmtSF | GrLivArea | GarageCars | ActualPrice | PredictedPrice | |
|---|---|---|---|---|---|---|
| 1164 | 5 | 1360 | 1432 | 2 | 194000 | 162748.547404 |
| 860 | 7 | 912 | 1426 | 1 | 189950 | 179918.696518 |
| 520 | 4 | 0 | 1294 | 0 | 106250 | 56278.072714 |
| 219 | 7 | 1248 | 1248 | 2 | 167240 | 200019.804560 |
| 1025 | 5 | 882 | 882 | 2 | 112500 | 124221.135055 |
It all looks as expected. Now we can move onto the main objective which is to find a way to "productionize" the algorithm and to deploy it.
Before we do that, let's quickly review the linear regression algorithm and what it means ...
A linear regression generates a straight line that best fits the independent variable (X axis) and dependent variable (y axis) -
Image by author
In a linear relationship between two variables like the one in the image the value of y can be predicted using a linear equation as follows -
$$y = mx + b$$Here $m$ is the co-efficient of $x$ and $b$ is the intercept.
Our model has 4 features and it can be represented as a linear equation with 4 co-efficients as follows -
$$y = ma + nb+ oc + pd + e$$... where a, b, c and d are our data values, m, n, o and p are the "co-efficents" we must multiply them by to predict the answer and e is the intercept.
Once the model has been built the LinearEquation class can be used to retrieve the co-efficients and the intercept of our linear equation as follows -
print(f"coefficients: {lr.coef_}")
print(f"intercept: {lr.intercept_}")
coefficients: [24408.26533016 28.04885569 45.67283515 18806.45718838] intercept: -100455.63728874002
If that is still a bit difficult to visualise the following Python function will create a simple DataFrame from the output of the model that we can use to easily see and use the output of the algorithm -
def coefficient_dataframe(lr, X_train):
df_coef = pd.DataFrame(data=lr.coef_, index=X_train.columns) # Create the DataFrame
df_coef.columns = ['Coefficient'] # Rename the only column to be "Coefficient"
df_coef.loc['intercept_'] = [lr.intercept_] # Add in the intercept
df_coef['StdDev'] = 0 # Create a column to store the standard deviation
for index, row in df_coef.iterrows(): # Add the standard deviation of the training dataset for each column
if index != 'intercept_':
df_coef.loc[index, 'StdDev'] = X_train[index].std()
df_coef.index.rename('Feature', inplace=True) # Rename the index
return df_coef # Return the completed data set
df_coef = coefficient_dataframe(lr, X_train)
df_coef
| Coefficient | StdDev | |
|---|---|---|
| Feature | ||
| OverallQual | 24408.265330 | 1.380430 |
| TotalBsmtSF | 28.048856 | 451.799887 |
| GrLivArea | 45.672835 | 539.122692 |
| GarageCars | 18806.457188 | 0.751555 |
| intercept_ | -100455.637289 | 0.000000 |
Note that I have included the standard deviation because if you want to compare the co-efficients they must be multiplied by the standard deviation to be expressed in the same units as the following example illustrates -
# This plot shows how much each contribution is contributing to the model output
df_plot = df_coef.copy()
df_plot.drop(['intercept_'], axis = 'rows', inplace=True)
df_plot['CoefTimesStddev'] = df_plot['Coefficient'] * df_plot['StdDev']
df_plot.sort_values(by='CoefTimesStddev',ascending=True)['CoefTimesStddev'].plot(kind='barh', figsize=(9, 7))
plt.title('Contribution of Co-efficients')
plt.show()
Image by author
Now we have everything ready and we can deploy the linear regression represented by our model in a few easy steps.
First let's have another quick look at the test data including the actual price and the predicted price ...
X_test_copy.head()
| OverallQual | TotalBsmtSF | GrLivArea | GarageCars | ActualPrice | PredictedPrice | |
|---|---|---|---|---|---|---|
| 1164 | 5 | 1360 | 1432 | 2 | 194000 | 162748.547404 |
| 860 | 7 | 912 | 1426 | 1 | 189950 | 179918.696518 |
| 520 | 4 | 0 | 1294 | 0 | 106250 | 56278.072714 |
| 219 | 7 | 1248 | 1248 | 2 | 167240 | 200019.804560 |
| 1025 | 5 | 882 | 882 | 2 | 112500 | 124221.135055 |
.. and also quickly review our co-efficient and intercept values -
df_coef
| Coefficient | StdDev | |
|---|---|---|
| Feature | ||
| OverallQual | 24408.265330 | 1.380430 |
| TotalBsmtSF | 28.048856 | 451.799887 |
| GrLivArea | 45.672835 | 539.122692 |
| GarageCars | 18806.457188 | 0.751555 |
| intercept_ | -100455.637289 | 0.000000 |
Now let's have a look at the values of the 4 features for the top row in our test data ...
X_test_copy.loc[1164]
OverallQual 5.000000 TotalBsmtSF 1360.000000 GrLivArea 1432.000000 GarageCars 2.000000 ActualPrice 194000.000000 PredictedPrice 162748.547404 Name: 1164, dtype: float64
To "productionize" our linear regression algorithm all we have to do is multiply the value of each feature in our chosen row of data with its associated co-efficent and add the intercept value as follows -
calc_price = -100455.637289 + (24408.265330 * 5) + (28.048856 * 1360) + (45.672835 * 1432) + (18806.457188 * 2)
calc_price
162748.54761699997
This shows that the sales price for the row with index 1164 was predicted to be 162748.547404 by the linear regression machine learning model and the amount predicted by our simple linear equation which used the co-efficent and intercept values was 162748.54761699997.
Our simple linear algorithm matched the model output to 3 decimal places which I have assumed is a rounding issue and it is certainly close enough to implement in our production environment.
Now that we know the co-efficients for each of the features and the intercept value we can write a very simple function that could easily be ported to Excel VBA, C#.NET or any other programming language.
This function is deceptively simple because it is the culmination of the entire data science pipeline of activities consolidated into one simple function, but in a real world data-science project there will have been a lot of work invested in getting to this point.
Please note that if this were a real-world example I would periodically re-run the data science pipeline on current data to ensure that my chosen features, their co-efficents and the intercept are still a good representation of the real world and that they are producing predictions that are sufficiently accurate to use, or alternatively I would update the co-efficient and intercept values.
Here is the final function, written in Python, but easily ported into any other language -
def implemented_house_price_algorithm(OverallQual, TotalBsmtSF, GrLivArea, GarageCars):
house_price = -100455.637289 # Intercept
house_price = house_price + (24408.265330 * OverallQual) # Co-efficient for OverallQual
house_price = house_price + (28.048856 * TotalBsmtSF) # Co-efficient for TotalBsmtSF
house_price = house_price + (45.672835 * GrLivArea) # Co-efficient for GrLivArea
house_price = house_price + (18806.457188 * GarageCars) # Co-efficient for GarageCars
return house_price
price = implemented_house_price_algorithm(5, 1360, 1432, 2)
print(f"${price:.2f}")
$162748.55
price = implemented_house_price_algorithm(7, 912, 1426, 1)
print(f"${price:.2f}")
$179918.70